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Abstract 

Genotyping errors are known to influence the power of both family-based and case-control 
studies in the genetics of complex disease. Estimating genotyping error rate in a given dataset 
can be complex, but when family information is available error rates can be inferred from 
the patterns of Mendelian inheritance between parents and offspring. I introduce a novel 
likelihood-based method for calculating error rates from family data, given known allele fre- 
quencies. I apply this to an example dataset, demonstrating a low genotyping error rate in 
genotyping data from a personal genomics company. 

1 Introduction 

High-throughput genotyping and sequencing technologies allow affordable genetic studies of a 
very large number of loci. However, both of these methods have the potential to wrongly infer 
the genotype of an individual at a particular site. It has been shown that the rate at which these 
errors occur (the genotyping error rate) can strongly influence the power of a linkage study [T], 
and moderately influence the power of a case-control study [6]. Estimating error rates is thus an 
important component of power calculations, as well as in assessing the quality of a dataset. 

A popular method of assessing genotyping error rate is to use family relationships between 
the samples within a dataset. In particular, the rate of impossible inheritance patterns under 
Mendelian inheritance (the Mendelian error rate) is a commonly used metric. The popular soft- 
ware package PLINKfT^ has functions to calculate Mendelian error rates in trios per site or per 
trio. However, the relationship between Mendelian error rates and genotyping error rates is not 
straightforward, and depends on the allele frequency spectrum of the sites being considered. 

Various methods have been developed to deduce genotyping error rates from Mendelian error 
rates. Hao et al[5] empirically measured calibration curves between Mendelian errors and geno- 
typing errors, and Saunders et al[5] derived an expression for Mendelian error rates in terms of 
genotyping error rates. However, neither of these methods provide a robust likelihood calculation 
for observed genotypes for a given error rate. In addition, neither method takes into account the 
behaviour of the large number of sites that do not contain Mendelian errors, but none the less can 
still contain information about error rates when coupled with allele frequency measures. 

I derive the full likelihood of observed trio genotypes given a genotyping error rate and allele 
frequencies, and given simplifying assumptions. This can be used to perform maximum likelihood 
inference of genotyping error rates across sites within a trio or across trios within a site. I apply 
this method to a trio genotyped by the personal genomics company 23andMe, demonstrating that 
the trio has a low genotype error rate. 

2 Method 

Here we derive the likelihood of observing a given set genotypes X across N sites in a single trio, 
though this can equally be applied to the genotypes of N trios at a single site. 
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We will denote the site j € I..N, and the individuals in the trio i = {o,m,p) for offspring, 
paternal and maternal. We will denote the (unknown) true genotypes = {gi,gin^9p)^ ^"^^ t^i^ 
observed genotypes = {xl,xl^,x0. We will assume a random per-chromosome error rate of 
e (the per-diploid genotype error rate is thus approximately 2e), and a variant-specific genotype 
frequency f^. The joint likelihood of true genotype and observed genotype is thus 

P(X^G^■|/^e) = P(x^j5^,e)P«|54,e)i^Kl5^,e)i^(i?^l5^,5^)i^(!?^l/^)^'(5^l/^) (1) 

Where P{gi\gii, 90 is given by Mcndclian transmission and P((7^ \f-') is the frequency of geno- 
type gl under Hardy- Weinberg equilibrium. P{x-l\gl,e) is the error function, with (1 — e)^ for no 
error, (1 — e)e for a heterozygous-to-homozygous or homozygous-to-heterozygous error, and for 
a homozygous-to-homozygous error. Note that for heterozygotes there is also a probability of 
a "double error" of that doesn't actually change the genotype. 

As the true genotype is unknown, we must sum over all possible genotypes: 

P{X^\P,e) = ^P(X^G^"|/^e) (2) 
The overall likelihood for all sites is thus equal to: 

N 

P{X\f, e) = n E ^(^'' ^'1/^ e) (3) 

However, this is computationally expensive to calculate repeatedly. To simplify the calculation, 
and allow easy recalculation after changing e, we can partition out the joint likelihood into terms 
that do and do not contain e 

P{X\ e) = P{X^G^,e)P{G^r) (4) 

where 

P{X^&,e) = Pixi\gi,e)P{xUgi,e)P{xi\gl,e) (5) 
P{G^f) = P{gi\gi,gi)P{gUmgi\n (6) 

The overall likelihood can thus be written: 



N 



P(X|/,e) = n 



P{X^\G^ = X\e)P{G^ = X^\f) + P{X^\G\e)P{G^f) 



(7) 



To simplify this equation, we will assume that only one error occurs (i.e. that all terms in or 
greater are negligible). Thus for the first term, where no errors occur P{X^ = G^\G^ ,e) w (1 — e)®. 
For genotypes in which one error occurs P{X^G^ ,e) ~ e(l— e)^, and for all others P{X^G^ , e) w 0. 
The likelihood then becomes: 



P(X|/,e) = (l-e)5^n 



il-e)P{G^ =X^\f) + e 



E 



P{G'\f) 



\one mutation 



(8) 



The P{G^ — X^\f^) and X)P(G-'|/-') terms only need to be calculated once, leaving a rela- 
tively easy likelihood calculation to do maximum likelihood estimation on. Point estimates and 
confidence intervals can then be calculated in the usual way. 
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3 Application 



I applied this method to a trio of individuals genotyped by the personal genomics company 
23andMe. This company provides consumers with genotyping and interpretation of their ge- 
netic data, while using the data generated to perform case-control association studies for mapping 
human traits. The company has discovered novel genetic associations [4] [3], and replicated many 
more[9]. Low genotyping error rates are critical for both aims, as genotype errors can reduce the 
power of the case-control analyses, and give consumers false information. 

I used a trio of individuals genotyped by 23andMe, along with allele frequencies from the 
60 CEU individuals of the HapMap 3 dataset[^. This allowed assessment of 715 566 variants. 
The error rate was estimated to be 8.5 x 10~^, with a 95% confidence interal of 6.8-10.2 x 10~^, 
suggesting that genotype errors are rare in this dataset. 
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